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Abstract: This paper presents a new algorithm to track mobile objects in different scene conditions. The main idea 
of the proposed tracker includes estimation, multi-features similarity measures and trajectory filtering. A 
feature set (distance, area, shape ratio, color histogram) is defined for each tracked object to search for the best 
matching object. Its best matching object and its state estimated by the Kalman filter are combined to update 
position and size of the tracked object. However, the mobile object trajectories are usually fragmented because 
of occlusions and misdetections. Therefore, we also propose a trajectory filtering, named global tracker, 
aims at removing the noisy trajectories and fusing the fragmented trajectories belonging to a same mobile 
object. The method has been tested with five videos of different scene conditions. Three of them are provided 
by the ETISEO benchmarking project (http://www-sop.inria.fr/orion/ETISEO) in which the proposed tracker 
performance has been compared with other seven tracking algorithms. The advantages of our approach over 
the existing state of the art ones are: (i) no prior knowledge information is required (e.g. no calibration and no 
contextual models are needed), (ii) the tracker is more reliable by combining multiple feature similarities, (iii) 
the tracker can perform in different scene conditions: single/several mobile objects, weak/strong illumination, 
indoor/outdoor scenes, (iv) a trajectory filtering is defined and applied to improve the tracker performance, (v) 
the tracker performance outperforms many algorithms of the state of the art. 



1 Introduction 

Many different approaches have been proposed 
to track the motion of mobile objects in video 
( |A.Yilmaz et al., 2006| l. However the tracking al- 
gorithm performance is always dependant on scene 
conditions such as illumination, occlusion frequence, 
movement complexity level. Some researches aim 
at improving the tracking quality by extracting the 
scene information such as: directions of paths, in- 
teresting zones. These elements can help the system 
to give a better prediction and decision on object tra- 
jectories. For example ( |D.Makris and T.Elhs, 2005] l 
have presented a method to model the paths in scenes 
based on detected trajectories. The system uses an 
unsupervised machine learning technique to compute 
trajectory clustering. A graph is automatically built 
to represent the path structure resulting from learn- 
ing process. In ( |D.P.Chau et al., 2009ai ), the authors 



have proposed a global tracker to repair lost trajecto- 
ries. The system learns automatically the "lost zone" 
where the tracked objects usually lose their trajecto- 
ries and "found zone" where the tracked objects usu- 
ally reappear. The system also takes complete trajec- 
tories to learn the common scene paths composed by 
<entrance zone, lost zone, found zone>. The learnt 
paths are then used to fuse the lost trajectories. This 
algorithm needs a 3D calibration environment and 
also a 3D person model as the inputs. These two pa- 
pers get some good results but both require an off-line 
machine learning process to create rules for improv- 
ing the tracking quality. 

In order to solve the given problems in mobile 
object tracking, we propose in this paper a multi- 
ple feature tracker combining with a global tracking. 
We use first the Kalman filter to predict positions of 
tracked objects. However, this filter is only an esti- 
mator for linear movements while the object move- 



ments in surveillance videos are usually complex. A 
poor lighting condition of scene also influences to the 
tracking quality. Therefore, in this paper we propose 
to use different features to obtain more correct match- 
ing links between objects in a given time window. We 
also define a global tracker which does not require 
3D environment calibration or off-line learning to im- 
prove tracking quality. 

The rest of paper is organized as follows: The next 
section presents in detail the tracking process. Section 
3 describes a global tracking algorithm which aims at 
filtering out noisy trajectories and fusing fragmented 
trajectories. This section also presents when a tracked 
object ends its trajectory. Section 4 shows in detail 
the results of the experimentation and validation. A 
conclusion is given in the last section as well as future 
work. 



2 Tracking Algorithm 

The proposed tracker takes as its input a bound- 
ing box list of detected objects at each frame. Pixel 
values inside these bounding boxes are also required 
to compute color metric. A tracked object at frame 
t is represented by a state s = [x,y,l,h] where (x, y) 
is center position, I is width and h is height of its 2D 
object bounding box at frame t. In the tracking pro- 
cess, we follow three steps of the Kalman filter: es- 
timation, measurement and correction. However our 
contribution focus on the measurement step. The es- 
timation step is first performed to estimate the new 
state of a tracked object in the current frame. The 
measurement step is then performed to search for the 
best detected object similar to each tracked object in 
the previous frames. The state of the found object 
refers to as "measured state". The correction step is 
finally performed to compute the "corrected state" of 
mobile object resulting from the "estimated state" and 
the "measured state". This state is considered as the 
official state of the considered tracked object in the 
current frame. For each detected object which does 
not match with any tracked object, a new tracked ob- 
ject with the same position and size will be created. 

2.1 Estimation of Position and Size 

For each tracked object in the previous frame, the 
Kalman filter is used to estimate the new state of the 
object in the current frame. The Kalman filter is com- 
posed of a set of recursive equations used to model 
and evaluate object linear movement. Let s^ ^ be the 
corrected state at instant f — 1 , the estimated state at 



time t, denoted , is computed as follows: 

si=^s+_^ (1) 

where <I> is the state transition matrix of n x n where n 
is the considered feature number (n = 4 in our case). 
Note that in practice <I> might change with each time 
step, but here we assume it is constant. One of the 
drawbacks of the Kalman filter is the restrictive as- 
sumption of Gaussian posterior density functions at 
every time step, as many tracking problems involve 
non-linear movement. In order to overcome this limi- 
tation, we give a weight value to determine the relia- 
bility of estimation computation and also of measure- 
ment (see section |23] for details). 

2.2 Measurement 

This is our main contribution in the tracking process. 
For each tracked object in the previous frame, the goal 
of this step is to search for the best matched object 
in the current frame. In tracking problem, the exe- 
cution time of tracking algorithm is very important 
to assure a real time system. Therefore, in this pa- 
per we propose to use a set of four features: distance, 
shape ratio, area and color histogram to compute the 
similarity between two objects. The computation of 
all of these features are not time consuming and the 
proposed tracker can thus be executed in real time. 
Because all measurements are computed in the 2D 
space, our proposed method does not require scene 
calibration information. For each feature / (/ — 1..4), 
we define a local similarity LSi in the interval [0, 1] 
to quantify the object similarity of the feature /. A 
global similarity is defined as a combination of these 
local similarities. The detected object with the high- 
est global similarity will be chosen for the correction 
step. 

2.2.1 Distance Similarity 

The distance between two objects is computed as the 
distance between the two corresponding object posi- 
tions. Let Dmiix be the possible maximal displacement 
of mobile object for 1 frame in video and d be the 
distance of two considered objects in two consecutive 
frames, we define a local similarity LSi between these 
two objects using distance feature as follows: 

LSi—max{0, I ~ d / {D,nax * fn)) (2) 

where m is the temporal difference (frame unity) of 
the two considered objects. 

In a 3D calibration environment, a value of D,„ax 
can be set for the whole scene. However, this value 
should not be unique in a 2D scene. This threshold 



will change according to the distance between con- 
sidered objects and the camera position. The nearer 
object to the camera, the larger its displacement is. 
In order to overcome this limitation, we set the D,„ax 
value to the length half of bounding box diagonal of 
the considered tracked object. 

2.2.2 Area Similarity 

The area of an object / is calculated by WiHj where 
Wi and Hi are the 2D width and height of the object 
respectively. A local similarity LS2 between two areas 
of objects / and j is defined by: 

min{WiHi, WjHj) 



LS2 - 

max{WiHi, WjHj) 

2.2.3 Shape Ratio Similarity 



(3) 



The shape ratio of an object / is calculated by Wj/Hj 
(where W, and Hi are defined in section lT.2.2l i. A local 
similarity LS^ between two shape ratios of objects i 
and /■ is defined as follows: 

min{Wi/H„ Wj/Hj) 



max{Wi/Hi, Wj/Hj) 
2.2.4 Color Histogram Similarity 



(4) 



In this work, the color histogram of a mobile object 
is defined as a histogram of pixel number inside its 
bounding box. Other color features (e.g. MSER) can 
be used but this one has given satisfying results. We 
define a local similarity LS4 between two objects ; and 
j for color feature as follows: 



LSa 



(5) 



where n is a parameter representing the number of his- 
togram bins, n = 1 ..768 (the value 768 is the result of 
product 256 x 3) and ratek is computed as follows: 

min{Hi{k),Hj{kj) 



ratek ■ 



(6) 



max{Hi{k),Hj{k)) 

Hi{k) and Hj{k) are successively the number of pixels 
of object /, i at bin k. There are some different ways 
to compute the difference between two histograms, in 
this work we choose the ratio computation for each 
histogram bin to obtain a value ratek normalised in 
the interval [0, 1]. Consequently the LSi^ value also 
varies in this interval. 



but its maximum speed cannot exceed a determined 
value. Therefore in our work, the global similarity 
value takes into account a priority of distance feature 
compared to other features to decrease the number of 
false object matching links. 



GS = 











ifLSi >0 



otherwise 



(7) 



where GS is the global similarity; vv, is the weight (i.e. 
reliability) of feature / and LSi is the local similarity of 
feature i. The detected object with the highest global 
similarity value GS will be chosen as the matched ob- 
ject if: 

GS > Ti (8) 

where T\ is a predefined threshold. Higher the value 
of T\ is set, more correct the matching links are es- 
tablished, but a too high value of T\ can make lose 
the matching links in some complex environment (e.g. 
poor lighting condition, occlusion). The state of this 
object (including its position and its bounding box 
size) is called "measured state". At a time instant f, 
if a tracked object cannot find its matched object, the 
measured state MSt is set to 0. In the experimenta- 
tion of this work, we suppose that all feature weight 
Wi have the same values. 

2.3 Correction 

Thanks to the estimated and measured states, we can 
update the position and size of tracked object by com- 
puting the corrected state as follows: 

' wMSt + {\-w)ESt if MS, ^ 
CS, = { (9) 
MSt-\ otherwise 

where CS,, MS,, ES, are the corrected state, measured 
state and estimated state of the tracked object at time 
instant f respectively; w is the weight of measurement 
state. If the measured state is not found, the corrected 
state will be set equal to the corrected state in the pre- 
vious frame. While the estimated state is only result 
of a simple linear estimator, the measurement step is 
fulfilled by considering four different features. We 
thus set a high value to w (w = 0.7) in our experimen- 
tation. 



2.2.5 Global Similarity 

A detected object compared to previous frames can 
have some size variations because of detection er- 
rors or some color variations by illumination changes. 



3 Global Tracking Algorithm 

Global tracking aims at fusing the fragmented tra- 
jectories belonging to a same mobile object and re- 



moving the noisy trajectories. As mentioned in sec- 
tion 12.31 if a tracked object cannot find the corre- 
sponding detected object, his corrected state will be 
set to the current corrected state. The object then turns 
into a "waiting state". This tracked object goes out of 
"waiting state" when it finds its matched object. A 
tracked object can turn into and go out of "waiting 
state" many times during its life. This waiting step 
allows us to let a non-updated tracks live for some 
frames when no correspondence is found. The sys- 
tem can so track completely object motion even when 
the object is not sometime detected or is detected in- 
correctly. This prevents the mobile object trajectories 
from being fragmented. However, the "waiting state" 
can cause an error when the corresponding mobile ob- 
ject goes out of the scene definitively. Therefore, we 
propose a rule to decide the moment when a tracked 
object ends its life and also to avoid maintaining for 
too long the "waiting state". A more reliable tracked 
object will be kept longer in the "waiting state". In 
our work, the tracked object reliability is directly pro- 
portional to number of times this object finds matched 
objects. The greater number of matched objects, the 
greater tracked object reliability is. Let Id of a frame 
be the order of this frame in the processed video se- 
quence, a tracked object ends if: 

F,<F,^mm{Nr,T2) (10) 

where F/ is the latest frame Id where this tracked ob- 
ject finds matched object (i.e. the frame Id before en- 
tering the "waiting state"), Fc is the current frame Id, 
A^,- is the number of frames in which this tracked ob- 
ject was matched with a detected object, T2 is a pa- 
rameter to determine the number of frames for which 
the "waiting state" of a tracked object cannot exceed. 
With this calculation method, a tracked object that 
finds a greater number of matched objects is kept in 
the "waiting state" for a longer time but its "waiting 
state" time never exceed 72. Higher the value of T2 
is set, higher the probability of finding lost objects is, 
but this can decrease the correctness of the fusion pro- 
cess. 

We also propose a set of rules to detect the noisy 
trajectories. The noise usually appears when wrong 
detection or misclassification (e.g. due to low image 
quality) occurs. A static object or some image regions 
can be detected as a mobile object. However, a noise 
usually only appears in few frames or does not dis- 
place really (around a fixed position). We thus pro- 
pose to use temporal and spatial filters to remove it. A 
trajectory is composed of objects throughout time, so 
it is unreliable if it cannot contain enough objects and 
usually lives in the "waiting state". Therefore we de- 
fine a temporal threshold when a "waiting state" time 
is greater, the corresponding trajectory is considered 



as noise. Also, if a new trajectory appears, the system 
cannot determine immediately whether it is noise or 
not. The global tracker has enough information to fil- 
ter out it only after some frames since its appearance 
moment. Consequently, a trajectory that satisfies one 
of the following conditions, is considered as noise: 



T <T3 
{dmax < T4) and {T > T^) 



(y > 75) and {T > T3) 



(11) 
(12) 



(13) 



where T is time length (number of frames) of the 
considered trajectory ("waiting state" time included); 
d,tiax is the maximum spatial length of this trajectory; 
r„ is the total time of "waiting state" during the life of 
the considered trajectory; 73, T4 and T5 are the prede- 
fined thresholds. While T4 is a spatial filter threshold, 
73 and 75 can be considered as temporal filter thresh- 
olds to remove noisy trajectories. The condition ( fTTT i 
is only examined for the trajectories which end their 
life according to equation ( flOl i. 



4 Experimentation and Validation 

We can classify the tracker evaluation 
methods by two principal approaches: off- 
line evaluation using ground truth data 
( |C.J.Needham and R.D.Boyle, 2003| l and on- 
line evaluation without ground truth data 
dP.P.Chau et al., 2009bl l. In order to be able to 
compare our tracker performance with the other 
ones, we decide to use the tracking evaluation 
metrics defined in ETISEO benchmarking project 



( A.T.Nghiem et al., 2007 1 which comes from the first 
approach. The first tracking evaluation metric Mi, 
which is the "tracking time" metric measures the 
percentage of time during which a reference object 
(ground truth data) is tracked. The second metric 
M2 "object ID persistence" computes throughout 
time how many tracked objects are associated with 
one reference object. The third metric M3 "object 
ID confusion" computes the number of reference 
object IDs per tracked object. These metrics must be 
used together to obtain a complete tracker evaluation. 
Therefore, we also define a tracking metric M taking 
the average value of these three tracking metrics. All 
of the four metric values are defined in the interval 
[0, 1]. The higher the metric value is, the better the 
tracking algorithm performance gets. 

In this experimentation, we use the 
people detection algorithm based on 
HOG descriptor of the OpenCV library 




a b c d e 



Figure 1: Illustration of tested videos: a) ETI-VS1-BE-18-C4 b) ETI-VS1-RD-16-C4 c) ETI-VS1-MO-7-C1 d) Gerhome 
e)TRECVid. The colors represent the bounding boxes and trajectories of tracked people. 



Chttp : //openc V. willowgarage . com/wiki/) . There- 
fore we focus the experimentation on the sequences 
containing people movements. However the principle 
of the proposed tracking algorithm is not dependent 
on tracked object type. 

We have tested our tracker with five video se- 
quences. The first three videos are selected from 
ETISEO data in order to compare the proposed 
tracker performance with that from other teams. The 
last two videos are extracted from different projects 
so that the proposed tracker can be tested with more 
scene conditions. All of these five videos are tested 
with the following parameter values: n = 96 bins (for- 
mula (IS), Ti = 0.8 (formula dSJ), T2 = 20 frames (for- 
mula (Uni)), 73 = 20 frames (formula 0), ^4 = 5 pix- 
els (formula ^) and T5 = 40% (formula O)- 

The first tested ETISEO video shows a building 
entrance, denoted ETI-VS1-BE-18-C4. In this se- 
quence, there is only one person moving, but the il- 
lumination and contrast level are low (see image a of 
figure [U. The second ETISEO video shows a road 
with strong illumination, denoted ETI-VSl-RD-16- 
C4 (see image b of figure [T]l. There are walker, bi- 
cyclists, car moving on the road. The third video 
shows an underground station denoted ETI- VS 1 -MO- 
7 -CI where there are many complex people move- 
ments (see image c of figure[T]i. The illumination and 
contrast in this sequence are very bad. 

In this experimentation, tracker results from seven 
different teams in ETISEO have been presented: 1, 
8, 11, 12, 17, 22, 23. Table [T] presents performance 
results of our tracker and of the ones of seven teams 
on three ETISEO sequences. Although each tested 
video has its proper complex, the tracking evaluation 
metrics of the proposed tracker get the highest values 
in most cases compared to other teams. In the second 
video, the tracking time of our tracker is low (Mi — 
0.36) because as mentioned above, we only use the 
people detector and so system usually fails to detect 
cars. 

The fourth video sequence has been provided by 
the Gerhome project (see image d of figure [TJ. The 



objective of this project is to enhance autonomy of the 
elderly people at home by using intelligent technolo- 
gies for house automation. In this sequence, there is 
only one person moving but the video length is quite 
long (13 minutes 40 seconds). We can find tracking 
results in the second column of table |2] Although the 
sequence length is quite long, the proposed tracker 
can follow person movement for most of the time, 
from frame 1 to frame 8807 (Mi = 0.86). After that, 
there are four moments when the detection algorithm 
cannot detect the person in an interval over 20 frames 
(over the value of T2)- Therefore the value of metric 
M2 for this video sequence is only equal to 0.2. 

The last tested sequence concerns the movements 
of people in an airport. This sequence is provided 
by TREC Video Retrie val Evaluation (TRECVid) 
( |A.Smeaton et al., 2006| l. The people tracking in this 
sequence is a very hard task because there are always 
a great number of movements in the scene and occlu- 
sions usually happen (see image e of figure (TJ. De- 
spite these difficulties, the proposed tracker obtains 
high values for all three tracking evaluation metrics: 
Ml = 0.71, M2 = 0.90 and M3 = 0.85 (see the third 
column of table O. 

The average processing speed of the proposed 
tracking algorithm for all considered sequences is 
very high. In the most complicated sequence where 
there are many crowds (TRECVid sequence), this 
value is equal to 20 fps. In the other video se- 
quences, the average processing speed of the tracking 
task is greater than 50 fps. This helps whole track- 
ing framework (including video acquisition, detection 
and tracking tasks) can become a real time system. 



5 Conclusion 

Although many researches aim at resolving the 
problems given by tracking process such as misde- 
tection, occlusion, there is still not a robust tracker 
which can well perform in different scene conditions. 
This paper has presented a tracking algorithm which 
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Table 1 : Summary of tracking results for ETISEO videos. 
N: video frame nvmiber, F: video frame rate, i : average 
processing speed of the tracking task (frames /second) (not 
taking into account the detection process). The highest val- 
ues are printed bold. 



is combined with a global tracker to increase the ro- 
bustness of the tracking process. The proposed ap- 
proach has been tested and validated on five real video 
sequences. The experimentation results show that the 
proposed tracker can obtain good tracking results in 
many different scenes although each tested scene has 
its proper complexity. Our tracker also gets the best 
performances in the experimented ETISEO videos 
compared to other tracker evaluated in this project. 
The average processing speed of the proposed track- 
ing algorithm is high. However, some drawbacks still 
exist in this approach: the used features are simple, 
more complex features (e.g. color covariance) are 
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Table 2: Tracking results for Gerhome and TRECVid 
videos, s denotes the average processing speed of the track- 
ing task (frames /second). 



needed to obtain the more reliable matching links be- 
tween objects. We also propose in future work an on- 
line automatic learning of the detected trajectories to 
improve the global tracker quality. 
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